23 research outputs found
Effect of Resting-State fNIRS Scanning Duration on Functional Brain Connectivity and Graph Theory Metrics of Brain Network
As an emerging brain imaging technique, functional near infrared spectroscopy (fNIRS) has attracted widespread attention for advancing resting-state functional connectivity (FC) and graph theoretical analyses of brain networks. However, it remains largely unknown how the duration of the fNIRS signal scanning is related to stable and reproducible functional brain network features. To answer this question, we collected resting-state fNIRS signals (10-min duration, two runs) from 18 participants and then truncated the hemodynamic time series into 30-s time bins that ranged from 1 to 10 min. Measures of nodal efficiency, nodal betweenness, network local efficiency, global efficiency, and clustering coefficient were computed for each subject at each fNIRS signal acquisition duration. Analyses of the stability and between-run reproducibility were performed to identify optimal time length for each measure. We found that the FC, nodal efficiency and nodal betweenness stabilized and were reproducible after 1 min of fNIRS signal acquisition, whereas network clustering coefficient, local and global efficiencies stabilized after 1 min and were reproducible after 5 min of fNIRS signal acquisition for only local and global efficiencies. These quantitative results provide direct evidence regarding the choice of the resting-state fNIRS scanning duration for functional brain connectivity and topological metric stability of brain network connectivity
Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition
Automatic recognition of disordered speech remains a highly challenging task
to date. The underlying neuro-motor conditions, often compounded with
co-occurring physical disabilities, lead to the difficulty in collecting large
quantities of impaired speech required for ASR system development. This paper
presents novel variational auto-encoder generative adversarial network
(VAE-GAN) based personalized disordered speech augmentation approaches that
simultaneously learn to encode, generate and discriminate synthesized impaired
speech. Separate latent features are derived to learn dysarthric speech
characteristics and phoneme context representations. Self-supervised
pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments
conducted on the UASpeech corpus suggest the proposed adversarial data
augmentation approach consistently outperformed the baseline speed perturbation
and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end
Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN
based augmentation produced an overall WER of 27.78% on the UASpeech test set
of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset
of speakers with "Very Low" intelligibility.Comment: Submitted to ICASSP 202
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Accurate recognition of cocktail party speech containing overlapping
speakers, noise and reverberation remains a highly challenging task to date.
Motivated by the invariance of visual modality to acoustic signal corruption,
an audio-visual multi-channel speech separation, dereverberation and
recognition approach featuring a full incorporation of visual information into
all system components is proposed in this paper. The efficacy of the video
input is consistently demonstrated in mask-based MVDR speech separation,
DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and
Conformer ASR back-end. Audio-visual integrated front-end architectures
performing speech separation and dereverberation in a pipelined or joint
fashion via mask-based WPD are investigated. The error cost mismatch between
the speech enhancement front-end and ASR back-end components is minimized by
end-to-end jointly fine-tuning using either the ASR cost function alone, or its
interpolation with the speech enhancement loss. Experiments were conducted on
the mixture overlapped and reverberant speech data constructed using simulation
or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel
speech separation, dereverberation and recognition systems consistently
outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute
(41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech
enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition
Automatic recognition of disordered and elderly speech remains a highly
challenging task to date due to the difficulty in collecting such data in large
quantities. This paper explores a series of approaches to integrate domain
adapted SSL pre-trained models into TDNN and Conformer ASR systems for
dysarthric and elderly speech recognition: a) input feature fusion between
standard acoustic frontends and domain adapted wav2vec2.0 speech
representations; b) frame-level joint decoding of TDNN systems separately
trained using standard acoustic features alone and with additional wav2vec2.0
features; and c) multi-pass decoding involving the TDNN/Conformer system
outputs to be rescored using domain adapted wav2vec2.0 models. In addition,
domain adapted wav2vec2.0 representations are utilized in
acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric
and elderly speech recognition systems. Experiments conducted on the UASpeech
dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and
Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently
outperform the standalone wav2vec2.0 models by statistically significant WER
reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two
tasks respectively. The lowest published WERs of 22.56% (52.53% on very low
intelligibility, 39.09% on unseen words) and 18.17% are obtained on the
UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set
respectively.Comment: accepted by ICASSP 202
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems
Fundamental modelling differences between hybrid and end-to-end (E2E)
automatic speech recognition (ASR) systems create large diversity and
complementarity among them. This paper investigates multi-pass rescoring and
cross adaptation based system combination approaches for hybrid TDNN and
Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid
LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and
Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used
to produce initial N-best outputs before being rescored by the speaker adapted
Conformer system using a 2-way cross system score interpolation. In cross
adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the
Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus
suggest that the combined systems derived using either of the two system
combination approaches outperformed the individual systems. The best combined
system obtained using multi-pass rescoring produced statistically significant
word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9%
relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and
Rt02 evaluation data.Comment: It' s accepted to ISCA 202
Intersecting distributed networks support convergent linguistic functioning across different languages in bilinguals
How bilingual brains accomplish the processing of more than one language has been widely investigated by neuroimaging studies. The assimilation-accommodation hypothesis holds that both the same brain neural networks supporting the native language and additional new neural networks are utilized to implement second language processing. However, whether and how this hypothesis applies at the finer-grained levels of both brain anatomical organization and linguistic functions remains unknown. To address this issue, we scanned Chinese-English bilinguals during an implicit reading task involving Chinese words, English words and Chinese pinyin. We observed broad brain cortical regions wherein interdigitated distributed neural populations supported the same cognitive components of different languages. Although spatially separate, regions including the opercular and triangular parts of the inferior frontal gyrus, temporal pole, superior and middle temporal gyrus, precentral gyrus and supplementary motor areas were found to perform the same linguistic functions across languages, indicating regional-level functional assimilation supported by voxel-wise anatomical accommodation. Taken together, the findings not only verify the functional independence of neural representations of different languages, but show co-representation organization of both languages in most language regions, revealing linguistic-feature specific accommodation and assimilation between first and second languages
Who can help me? Understanding the antecedent and consequence of medical information seeking behavior in the era of bigdata
IntroductionThe advent of bigdata era fundamentally transformed the nature of medical information seeking and the traditional binary medical relationship. Weaving stress coping theory and information processing theory, we developed an integrative perspective on information seeking behavior and explored the antecedent and consequence of such behavior.MethodsData were collected from 573 women suffering from infertility who was seeking assisted reproductive technology treatment in China. We used AMOS 22.0 and the PROCESS macro in SPSS 25.0 software to test our model.ResultsOur findings demonstrated that patients’ satisfaction with information received from the physicians negatively predicted their behavior involvement in information seeking, such behavior positively related to their perceived information overload, and the latter negatively related to patient-physician relationship quality. Further findings showed that medical information seeking behavior and perceived information overload would serially mediate the impacts of satisfaction with information received from physicians on patient-physician relationship quality.DiscussionThis study extends knowledge of information seeking behavior by proposing an integrative model and expands the application of stress coping theory and information processing theory. Additionally, it provides valuable implications for patients, physicians and public health information service providers
Finishing the euchromatic sequence of the human genome
The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead
Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition
Articulatory features are inherently invariant to acoustic signal distortion
and have been successfully incorporated into automatic speech recognition (ASR)
systems designed for normal speech. Their practical application to atypical
task domains such as elderly and disordered speech across languages is often
limited by the difficulty in collecting such specialist data from target
speakers. This paper presents a cross-domain and cross-lingual A2A inversion
approach that utilizes the parallel audio, visual and ultrasound tongue imaging
(UTI) data of the 24-hour TaL corpus in A2A model pre-training before being
cross-domain and cross-lingual adapted to three datasets across two languages:
the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora;
and the English TORGO dysarthric speech data, to produce UTI based articulatory
features. Experiments conducted on three tasks suggested incorporating the
generated articulatory features consistently outperformed the baseline hybrid
TDNN and Conformer based end-to-end systems constructed using acoustic features
only by statistically significant word error rate or character error rate
reductions up to 2.64%, 1.92% and 1.21% absolute (8.17%, 7.89% and 13.28%
relative) after data augmentation and speaker adaptation were applied.Comment: arXiv admin note: text overlap with arXiv:2203.1027